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Abstract 

For the normal linear model regression setup, Zellner's (^-prior is 
extended for the case where the number of predictors p exceeds the 
number of observations n. Exact analytical calculation of the marginal 
density under this prior is seen to lead to a new closed form variable 
selection criterion. This results are also applicable to the multivariate 
regression setup. 
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1 Introduction 

Suppose the normal linear regression model is used to relate Y to the poten- 
tial predictors Xi, . . . , Xp, 

Y ^ N^{aln + X(3,a^I) (1) 

where a is an intercept parameter, 1„ is an n x 1 vector each component 
of which is one, X = {Xi, . . . , Xp) is an n x p design matrix, /? is a p x 1 
vector of unknown regression coefficients, and o"^ is an unknown positive 
scalar. We assume that rank X = min(n,p) and also that X is in advance 
standardized so that (Xj — Xjl„)'(Xj — Xjl„) for 1 < i < p is equal to each 
other. The variable selection problem is important because there is usually 
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some unknown subset of the useless predictors and it should be preferable 
to ignore them. It would be convenient throughout to index each of these TP 
possible subset choices by the vector 

7 = (7i,---,7p) 

where 7^ = or 1. We use = 7'lp to denote the size of the 7th subset. 
The problem then becomes that of selecting a submodel of ([T]) of the form 

Py{Y\a, P^, a^, 7) = Nnialn + a^I) (2) 

where is the n x matrix whose columns corresponds to the 7th subset 
of Xi, . . . ,Xp, is a X 1 vector of unknown regression coefficients. Let 
denote M.^ the submodel given by ([2]). 

In this paper, we consider variable selection in the Bayesian framework. 
If the prior probability of any model is equal to each other, that is, 

Ft{M^) = Pr(A^y) for any 7 7^ 7', 

we choose A^^ as the best model which maximizes the posterior probability 
of given y 

where m^{y) is the marginal density of y under Ai^. In the linear model 
a, and are unknown parameters and eventually we will give prior 
distributions for all of them, which means full Bayes method. First of all, 
we consider prior distributions for a and j3y. As the prior measure of a, the 
Lebesgue measure 

Pa{a) = 1 (4) 

is used, which leads that the level of predictive values y is not shrunk av- 
eragely. There are a lot of papers which discuss the way to use improper 
priors like the Lebesgue measure because non-informative (objective) priors 
are usually improper. The one of the main meeting point is that the improper 
prior for model specific parameters is not allowed to use in the context of 
Bayesian model selection since improper priors are determined only up to an 
arbitrary multiplicative constant. On the other hand, if parameter appear 
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in all models entertained, the improper priors can be allowed as (jl]) for the 
intercept parameter a. We basically conform the rule, but as a matter of 
fact, all improper priors in this paper are expressed as limits of sequences of 
proper priors, which never causes trouble. For example, we use a sequence 
of normal distribution for a 



N{0, hi 



(5) 



where ha ^ oo, instead of using (jlj) directly. 

Since we consider the normal linear regression model, the most tractable 
prior distribution of ^ is normal con jugate. Among a class of normal con- 
jugate priors, so-called IZellnerl (Il986l )'s ^f-prior 



(6) 



where is a centered matrix of by su btracting the corre s pond i ng mean 



from e a ch of them, i s extre mely often used (iGeorge and Fosterl (120001 ) ; iFernandez et al, 
fl200lh : lLiang et al.l tOO^ ). Since includes the inverse of Z'^Z^, it is ap- 
plicable only for the traditional situation p < n — 1, which means < n — 1 
for any A4^. As shown in Section [21 the marginal density of y given g and 
cr^ under Ai^, which is denoted by m^{y\g, a^), is given by 



oc 



exp ( \ msix\ogp{Y\a, f3^,a'^) - ^^-^log{g + 1) 



(7) 



when the priors (ISl) with hg, = oo a nd (jH]) are used. Actually ([7]) is the key 
relation in lGeorge and Fosterl (120001 ). Hence if the variance is known and 



we choose g independently of y which satisfies 



9+1 
9 



log{g + 1) = 2, or logn 



the Bayesia n strateg y using (^^) exactly corresponds to AIC by lAkaikd (1l974l ) 
or BIG by ISchwara (1l978l ). This equivalence is the weighty reason that 
shrinkage prior of f5 given by ([6]) is useful for selecting the best model 
in the Bayesian framework. A wide variety of related choices for penalty 
term i(q + l)/q|l o g(q -f - 1) have been proposed by others. For example, 
George and Fosterl (12000! ) use the empirical Bayes method to choose g de- 
pendently of y. To be more precise, they recommend that the maximizer of 



3 



the marginal likelihood of y 



g = argmax^ ^ m{y\g, a^, 7) 



(8) 



should be used as the estimator of g. 

In maximizing ([7]), the variance cx^ i s assumed to be kriown w hereas it is 



usually unknown in the real problems. iGeorge and Fosterl (120001 ) insert the 
unbiased estimator of variance based on the full model given by ([1]) or the 
submodel after deriving the criterion ([7]). In rigorous Bayesian point of 
view, however, if the variance is an unknown parameter, the prior distribution 
should be given. Furthermore we would like to assume the prior distribution 
of g instead of estimating a fixed g by empirical Bayes method. This is one 
of our motivation of this paper. See Section 2 for the detail of the priors of 
0"^ and g. 

Additionally treating (very) many regressors case {p > n — 1) becomes 
more and more important in modern statistics. Since the residual sum of 
squares (RSS) is zero in the case where > n — 1, naive AIC and BIG 
methods do not work. Even worse, when > n — 1, where the covariance 
matrix in Zellner's q-prior, ( Z'^Z^ )~^, does not exist . Tha t is the reason why 
neither George and Fosteii ( 20001 ) nor Liang et al. ( 2008) work. In such a 
problem, non- Bayesian regural i zation methods (Lasso bv iTibshiranil (119961) . 
Elastic-net by Zou and Hastie ( 2005 ) and their variants by Zou ( 20061 )) are 
being considered as standard methods. These are however based on numer- 
ical technique, so we cannot clearly see what is happening in terms of the 
fundamental aggregated of data as RSS. 

In this paper, we consider full Bayes method which is applicable for any 
case {p > n — 1 and p < n — 1). Needless to say, full marginal density 
is expressed by multiple integration and so has been always calculated by 
numerical method like MCMC. Notice here that even if normal conjugate 
prior for f3 including g-pnoT and inverse-gamma conjugate prior for cr^ are 
used, there is still the integral as long as the full Bayes marginal density is 
considered. Such numerical full Bayes methods cannot clearly show us what 
is happening in terms of the aggregated data. This is the motivation of this 
paper and actually we find a special variant of Zellner's g-phoi which enables 
us to not only calculate the marginal density analytically but also treat many 
regressors case. 

In Section [21 we will give a special variant of g-pnoi which is expressed 
by the singular value decomposition of and propose propose the Bayesian 
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criterion as follows: 

BC[A^^] (9) 

r - ~1 -n+l 

{sv[Z,]x||/?^^^[7]||} ifg,>n-l 



2N-l/4-g^/2 



{min.sv[Z^]}''^sv[Z^]-'?^(RSS^ + {min.svfZ^]}^!!/?^^^ 

vl'RQC; '|-(n-q-,)/2+3/4 g(g7/2+l/4,(n-g^)/2-3/4) 
Al^jaOO^; B(l/4,{n-gT,)/2-3/4) 

if < n — 2. 

Here || ■ || denotes the Euclid norm, RSS^ is the residual sum of squares under 
Ai.y, sv[Z^] and min.sv[Z^] are the geometric average and minimum value of 
the singular values of Z^, respectively. And also Plse[i] for q-y < n — 2 is 
the usual least squares estimator, Pls^l'j] for > n — 1 is the least squares 
estimator using the Moore-Pennrose inverse matrix. The submodel, which 
maximizes BC[A1^] among the class of candidates, is chosen. 

The most advantage of our criterion BC[A4-y] is that it is a beautiful closed 
form based on fundamental aggregated information of data. In particular, 
the interpretation is interesting for many regressors case. PffsEll] is the 
minimizer of among j3 which satisfies the equation y — yl^ = Z^j3 under 
each submodel A^-y. That is why (3f^sE[l] is the default estimator when 
^7 > n— 1, but ||/3l5£;[7]II itself is not comparable beyond J^^. The expression 
of the criterion ^ shows that sv[Zy] x ||/3^5£;[7]|| is comparable beyond My, 
that is, the smallest means the best among the submodel M.^ which satisfies 
> n — 1. 

In Section [3l we discuss the choice of hyper-parameters which appears 
in the variant of (^-priors. In Section HI the estimation after selection and 
Bayesian model averaging are discussed. In sectional we show that BC[7W^] 
has consistency for model selection as n ^ oo. 



2 The marginal density under full Bayes 

In this section, we derive the marginal density under full Bayesian setting. 
2.1 prior distribution 

First of all, we give the variant of Zellner's gf-prior which is suitable for any 
case [p < n and p > n). For our purpose, we will make the use of the singular 
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value decomposition of Z, 



(10) 



1=1 



where = min(g^, n — 1). Notice that the n — 1 is from the fact that Z^ is 
the centered matrix. In (fTOj) . and are n x and x orthogonal 
matrices, with the columns of U^, ('Ui[7], . . . , ^^[7]), spanning the column 
space of Z^, and the columns of W.y, (?i;i[7], . . . , ^^[7]), spanning the row 
space. is an x diagonal matrix whose diagonal components satisfy 
dill] ^ ■ ■ ■ ^ drlj] > . In the following, "gamma" 7 is omitted 
D^, W^, Ui[l], di['j], Wi['j], r.y and q^, that is, we just write Z, /5, U, D, W, 
Ui, di, Wi, r and q for them. Note that on the expression of ffTU]) . the signs 
of Ui, . . . ,Ur and Wi, . . . ,Wr are indeterminate whereas the signs of Uiw[ are 
determinate. However we will see that these indeterminations do not cause 
troubles in this paper. 

Then the variant of Zellner's g-prioT density of (3 is given by 



where Pi{-\g,cr'^) is the probability density of N{0, a'^d^'^{i'i + Uig — 1}) for 
1 < 2 < r. And also W^^ for g > r is a g x {q — r) matrix which makes 
the q X q matrix Wq = (W,W^) orthogonal and p#(W^P) is an arbitrary 
probability density with mean 0. Also z/j for 1 < z < r should be greater 
than 1 because of the non-negativeness of the variance. Eventually we will 
see that the descending order ui > ■ ■ ■ > Ur is reasonable for our purpose. 
Among a class of descending u, our recommendable choice is 



for 1 < i < r, but in the following, we do not specify Ui = df/d^ but just 
assume that i^i > 1 for any i. See Section H] for the detail. When r = q which 
implies that (Z'Z)^^ exists, ( ITT|) reduces to the original Zellner's prior given 
by © if z/i = ■ ■ ■ = I/, = 1. 

In addition to priors for a and given by and (ITT]) , it is assumed 
that 0"^ and g have prior distribution 




(11) 



Ui = dl/dl 



(12) 




(13) 



6 



g\l + g)—''-^ 
B{a + l,b+l)' 



'{0,00 



)i9) 



(14) 



where a>— 1,6>— 1 and > 1. It is easy to see that f|T3l) is a sequence 
of proper prior densities approaching an improper prior (cr^)~^/(o,oo)( 



as 



oo. Note that, almost all papers in earlier studies recommend that 



the density (cr^)~^ should be chosen in Bayesian model selection context 
because it has invariance to scale and location transformations. The prior 
distribution of g, (IT^ . is what is called Pearson Type VI distribution or 
beta-prime distribution. Actually the name "beta-prime" is from the fact 
that 1/(1 + (?) has the Beta distribution Be{a + 1,6+1). 

Originally Zellner's gf-prior did not assume that g has a distribution like 
( HM . There are several papers about the way to deterniine a fixed g which 
depends on p and/o r n. See Section 2 . 4 oflLiang et al.l (120081 ) for the detail. 
As far as we know, Zellner and Siowl ( 1980 ) implicitly introduced the prior 
distribution of g because their p{(3\(j'^) is multivariate Cauchy. In other words, 
their Pg{g) corresponds to inverse Gamma prior 

(n/2)^/2|p^y2)}-i(7"'/V"/(25). 



Liang et al.l (120081 ) explicitly introduced the distribution for g which they call 
hyper-^f priors. They just considered a subclass of the beta-prime distribution 



pig) 



(15) 



but we will see that the generalization of (|T5l) like iHM makes the class of 
marginal densities rich. Actually, in the estimation of a multivariate normal 
mean, equiva l ent pr iors to (jl]), (fT5|) and (fUD. has been considered b y 
StrawdermanI (Il97ll ) and extended by iMaruyama and StrawdermanI (120051 ) . 



2.2 marginal density 

The marginal density of y under TW^, is given by 

••her poo 



m^{y) 



Py{y\a,(3,a )pa{a)pp{(3\g,o- 

-oo J m Jh^^ Jo 

2\„ /„N J„ J/3 J/_2\ 



X p^2(a )pg{g)dad(3 d{a )dg 



ha 



h-^ Jo 



m^{y\9, o-^)Pa^ {(^'^)Pg{9) d{(T^) dg, 



(16) 
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where mj{y\g^ a^) is the marginal density given g and cr^, 

/OO P 
/ Py{y\oi, P, a^)pa{a)p(3{P\g, cr^)da dp. 
-OO J R1 

By the simple relation 

y — aln — Xp — {—a + y — x'P)ln + v — Zp 

where y means the mean of y and v = y — we have the Pythagorean 
relation, 

\\y - aln - XpW^ = n{-a + y- x' pf + \\v - Zpf. 
Then m-y(?/|g', cr^), the marginal density given g and o"^, is 

Py{y\a, P, cr^)Pa{a)Pf3{P\g: cr^)da dp 
(27r)-i/2 1 / n{a-y + x'Pf \\v - ZP\\' 



OO J m 



K (27ra2)-/2^''P V 2(72 



«2 



X exp (^-^ J vMg, (y ) da dp. 
By making the orthogonal transformation 



it satisfies 



P^W'P^ P,, if g = r, 



lim hamj{y\g,a'^ 

hex — ^oo 



roc . (2^)-i/2 / n{a-y + x'py \\v - Zpr \ 

X ppiPlg, cr^) da dp 
(27r)-V2nV2 |^(ff)|V2 /• / lk - ^miP _ P'Mg)P. \ 

(27r(72)(-l)/2 (27r(72)'-/2 '^''I' 1^ 2(72 2(72 J 



(= 1) ifq> 



r 



1 if g = r 

where "^Ig) = diag((i^(z/i + uig — . . . ,d'^{ur + ^rQ — 1)"^)- The first 
equation of f|T7|) is from the monotone convergence theorem. Completing 
square with respect to /3* as 

\\v-UD(3,f + (3',m(3, 

= {13, - (D^ + ^>)-^D'U'v)\D^ + - (D^ + ^>)-^D'U'v) (18) 

- v'UD{D^ + + t;'f , 

and noting that 



and also 



-v'UD{D^ + ^y^D'U'v + v'v = -v' ( ^Uiu[ J^^^ j v + t;'?; 



. 1=1 

r 



q ^ — ' 

^ i=i * 



we have 



lim ham^{y\g,a'^) 

/let— ►OO 

exp — (19) 



(27ra2)(n-i)/2 ^,1/2 ^ ^ 2(t2 

where 

RSS^ = - J^^''^^ ^ESS-, = ^ 

i=l i=l 

RSS^ and GESS^ means the residual sum of squares and "generalized" ex- 
plained sum of squares, respectively, li ui = ■ ■ ■ = Ur = I, GESS-y reduces to 
the normal ESS-y, ^l=i{u'jVy. 
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Remark 1: If r = g (that is, g < ri — 1) and Ui = ■ ■ ■ = Uj. = 1, we have 



hm ham^{y\g,a ) 



(27r)-VV/% f RSS, + (^? + l)-iESS, 

'1 + g) ^' exp ' 



^ ^ exp - \ ^ iog(l + (^) 



(27r(T2)("-i)/2 2(^ + l)a2 2(^ + 1)(t2 2 

« exp \max\ogp{Y\a,(3^,a^) - \og{g + V 

\g + i { a,0^ Zg 



where RSS^ + ESS^ = v'v. This is the key relation in iGeorge and Foster 
( 120001 ). where they discussed the relationship between the Bayesian criterion 
under Zellner's prior, AIC and BIG when the variance is known. 



Using ( 1191) . we have 
lim lim haGcr{hfj)m^{y) 

ha — ^OO ha — *00 



^ exp (-^^^^P^J^^^) d[a^) dg (20) 

-^^72 + gV"^^-'-'^" {{g + i)RSS, + GESS,}-("-^)/V.(^) dg, 



where 



GM= I \(y^r^ da^ = 2\ogK 



and 



nV^r({n-l}/2) 

^(^) = ^^v^^TT^ • ^^^^ 

The second equality in (!20!) is from the monotone convergence theorem. 
If r = n — 1, RSS^ = and hence we have 



n-l 

lim lim /i„G,(/i,)m-^(i/) =K(r^)TTz/^^/2(GESS^)"^""^^/^ (22) 



1=1 
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Hence in this case, m^{y) does not depend on the prior density of g. Notice 
also that ii ui = ■ ■ ■ = z/„_i = 1, GESS-y just becomes v'v = {y—ylnYiy—yln) 
and so fl22l) does not work for model selection because it always takes the 
same value when r = n — 1. That is why the choice of u is important. 

On the other hand, we consider q < n ~ 2, which implies that r = q. Let 
— 1 < a < —1/2 and b = (n — 5)/2 — r/2 — a, where b is always strictly greater 
than —1 for any Ai^ which satisfies q^ < n — 2. Then we have 

lim lim haG„{ha-)m^{y) 



K{n) 



oo 

g\l+g)—'-^(l + g)-l/^+(^-'y^ 



UU^y"B{a + l,b + l) Jo 
X {{g + 1)RSS-^ + GESS^}~^"~^^/^ dg (23) 

9 { T.OO , ^T^co 9 + 1] dg 



m=i^y'B{a + l,b+l) Jo VRSS, + GESS, 
_ K{n){RSS^ + GESS-y)-("-^)/^+^+^ B{q/2 + a + 1, 6 + 1) 
~ riLi^^y'lRSS^j^+i 5(a + l,6 + l) 

Note that the marginal density of the null model, AiN, can be obtained as a 
special case of expressions fl24|) with g = and GESS^ = 0, 

lim lim ho,G„{h„)mj^{y) = K{n){v'vy^''-^'>/^ . 

/la— >00 hjT— *00 

Remark 2: When 6 ^ (n — 5)/2 — r/2 — a, the integral (!23|) can be 



expressed by hyper-geometric function. iLiang et al.l (120081 ) deal with the 
case where 6 = and approximate the hyper-geometric function by using 
the Laplace method. On the other hand, our marginal density given by (l22il 
and (12^ is more desirable because it include neither the integral and infinite 
series. 

In summary, we have a following result on the marginal density under 

Theorem 2.1. Assume that prior distributions of a, /3-^|(cr^, g) and o"^ are 
given by ([5]) with ha = oo and ( fTTl) . (|T3l) with h^ = oo. When q^ > n — 1, the 
prior distribution of g is arbitrary. When q-y < n — 2, the prior distribution 
of g is given by f|T4l) with — 1 < a < —1/2 and b = (ra — 5)/2 — q^/2 — a. 
Then the marginal density of y given by f|T6l) satisfies 

lim lim haG^{h„)m^{y) = K{n)MD[M^] 
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(TVID is from Marginal Density) where K{n) is given by (^T^ and 

(25) 

where z/j[7] > 1 for any i, RSS-y is the residual sum of squares under M.^ and 
GESSy = — - — , for = min(g^, n — 1). 

'1=1 

Under assumptions of Theorem 12. 11 the posterior probabihty of A^-y given 
y is well-defined as 

Pr(A^^|y)= hm hm - l™'^— l™/^.-- ^"^-(^-)"^7(l/) 



Hence we have a following result. 

Corollary 2.1. As a Bayesian model selection criterion, it is justified that 
the submodel M.^ which maximizes MD[M.y] given by (125|1 among the class 
of candidate models is chosen. 

At this point, a has not yet been fixed while a has to be between (—1,-1/2) 
for well-defining the marginal density. As the default choice of a, we recom- 
mend 

a = -3/4 (26) 

which is the median of (—1,-1/2). In the next section, we consider the 
relationship between the choice of a and the asymptotic behavior of the 
prior density of f3. As for i/, as mentioned in (1121) . Ui = df/d^ for 1 <i <r 
should be recommended. In this case, we have 



GESS-y — c?^ ^ ^ — — — 
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E 



WiU'iV 



dj 



dlWtfEf ifr = n-l, 



(27) 



r 



1/2 



dl{sv[Z^]}-\ 



(28) 



i=l 



When we put together ([26]), ([27]) and ([28]) in MD[A^^] given bv fl25]) . we 
write it BC[A^^] (Bayesian Criterion) as our recommendation: 



This is very interesting because it consists of fundamental aggregated infor- 
mation of original data. 

2.3 The geometric mean of the singular values 

The property of geometric mean of the singular values given by ([28]) is not 
well-known. Hence, in this subsection, we investigate the behavior when the 
new explanatory variable is added to a centered design matrix. 

Theorem 2.2. Fix an nx q centered matrix Z where the rank Z = min(n — 
1, q) and assume that annx 1 vector z which satisfies z'ln = is added from 
the right hand side. Let Z^, be an nx (q + 1) matrix {Z, z) and RSS{z; Z) the 
residual sum of squares of the regression z Zh when q < n — 2. Then 



if q > n — 1. 

We see that when q < n — 2, sv[Z^] may become smaller than sv[Z] if the 
new added variable z is nearly linear dependent. On the other hand, when 
q > n — 1, sv[Z^] always becomes larger. 




sv[Z^] 



{-sv[Z]}'^/^'i+^^{RSSiz; Z)} 



l/(2q+2) 



if q < n — 2 and 



sv[Z^:] > sv[Z] 
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Proof. When q < n — 2, both Z'Z and Z'^Z^, are non-singular. Since Z^Z^ is 
expressed partitioned matrix as 



Z'^Z^, 



Z'Z Z'z 

z' Z z'z 



we have 

\Z'^Z,\ = {z'z - z'Z{Z'Z)-^Z'z}\Z'Z\ = ||(/ - Z{Z'Z)-^Z')zf\Z'Z\, 

where | ■ | denotes the determinant of the non-singular matrix. Since \Z'^Z^\ = 
sv[Z^]'^~^^ and \Z'Z\ = sv[ZY, we have completed the proof. 

Next we consider the case q > n — 1, which means that the number 
of positive eigenvalues of Z'^^Z^, and Z'Z is n — 1. Let Xi{Z'^Z^) > ■ ■ ■ > 
\n-\{Z'^Z^ and Xi{Z'Z) > ■ ■ ■ > Xn-i{Z'Z) be the eigenv alues o f Z'^Z ^: and 
Z'Z, respectively. By the Sturmian separation theorem (IRaol (119731 )). we 
have 

\iZ',Z,) > \{Z'Z) 
for all 1 < i < n — 1, which guarantees that 



sv 



„_1 ^ l/(2{n-l}) N l/(2{n-l}) 

[z^] = { n HKz.) \ >\\{ Hz'z) \ = sv[z]. 



i=l J I 1=1 



□ 



3 The choice of hyper parameters of an ex- 
tending ^-prior 

3.1 The asymptotic behavior of the prior density of /3 

As introduced in Section 2, the probability distribution of /? given (cr^ and 
g) and g are provided by (fTTj) and f|T^ respectively. Hence the probability 
density of j3 given cr^ is provided by 

p{(3\a') = J p{W'[3\g,a')p,{g)dg (29) 
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where p{W'f3\g, o"^) is the probabihty density of multivariate normal Nr{0, cr^\E'~^) 
where \E' = diag((i^{z/i + uig — . . . , d'^ii^r + ^9 — In this section, 

we consider the asymptotic behavior of the density p(/3|cr^) as ||/3|| — > oo. 
We easily see that 

< ^ < (30) 



{ui + l)g Vi + UiQ-l Urg'' 

for g > vi and any i. The Tauberian Theorem, which is the well-known 
theorem for describing the asymptotic behavior of the Laplace transform, 
tells us that the contribution of integrals around zero is negligible when we 
consider the asymptotic behavior as \\(3\\ oo. So we have only to consider 
the integration between ui and oo (the major term). Using the relation (130|) . 
we have 

A, r f^v"" fiy'"""exp / 



< the major term of p(/3|o"^) 



where A = {B{a + 1, 6 + 1)} ^(27ro"^) From the Tauberian theorem, we 
see that there exist Mi < M2 such that 

||^||r+2a+2 

Ml < ^^p(/5|a^) < M, (31) 

for sufficiently large /3. 

By (pTI) . smaller a means flatter tail behavior and hence objectivity of 
prior information of f3. Note that for a = —1/2 the asymptotic tail behavior 
of p(/3|cr^) corres ponds to that of r nultiy ariate Cauchy distribution which is 



recommended in lZellner and Siowl (Il980l ). So asymptotic tail behavior of our 



choice a = —3/4 in fl26l) . ^ is flatter than that of the multivariate 
Cauchy distribution, H'^n-^'^i 



3.2 0(n) selection 

In fl3Tl) . we see that the choice of b does not depend on the order of asymptotic 
behavior of the density of (3. Originally b was supposed to be chosen to derive 
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a simple form of MD[A1^] in fl23p . In addition to the role for simplicity, the 
choice of b resolves so called 0{n) selection of g. The importance of 0{n) 
selection has been pointed out in the context of the choice of fixed g. Since 
increasing the data with a fixed choice of g would decrease the prior variance 
on the coefficients. Indeed, from (fTTj) . we have 



tr{Var(/3|^,a^)} = a^X^ '^- + |^f \ 

1=1 * 

Since ci^ = 0(n) by Theorem EH tr{Var(/5|^, a^)} = gO{n-^) if = 0(1). 
Therefore the choice g = 0{n) will prevent the prior from asymptotically 
dominating the likelihood. That is why Zellner (1986) and others have rec- 
ommended choosing g = 0{n) in g-phois. 

In our case, g is assumed to be a random variable. Since the choice 
a = —3/4 means flatter tail, E[g] does not exist. So, as alternatives, the 
mode of g and {£'[(7^^]}^^ are considered. We easily see that 

(The mode of q) = = — 

^ ^ a+2 5 

1 h 

2{n — q^) — 7. 



E[g-^] a + l 
Hence our prior information of g is averagely 0{n). 



4 Estimation after selection and BMA 

In this section, we assume that the submodel M.^ has been already selected. 
Usually the posterior mean, E[[3^\y\, is used as an estimator of [3^. This is 
regarded as the Bayes estimator under the loss function [5 — [3)'Q{5 — (3) for 
any positive-definite matrix Q. It is not however scaled by the variance cx^. In 
the area of Bayesian estimation theory, the scaled loss function {6 — j3)'Q{6 — 
/3)/cr^ is also used, where the Bayes estimator becomes Ela'"^ P\y]/ E[a~'^\y]. 
Here we consider the latter Bayes estimator because it has a simple closed 
form from the similar calculation as in Section [2l 
Notice that 

= WqWo = WW' + W#W'^ 
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and hence P = WP^ + W^P^. By the calculation in (ITS]) and E[P^\y] = 0, it 
is easy to see that 



_ ^ w.u[v f 1 E[a-\1 + g)-^y] . 



i=l 



Before we go further more concrete form of E[(7~'^{1 + g)~^\y]/ E[cr~'^\y], 
we consider the relationship between the estimator fl32|) and the usual least 
squares estimator. When < n — 2, the least squares estimator of /5 can be 
written as 



9 



PLSE = {Z',Z,rZ'^v = J2^- (33) 

1=1 

When > n — 1, the default estimator is 

1 / 

Pt7E = Z-v = J2^ (34) 

i=i 

where Z~ is the Moore- Pennrose inverse matrix Z^. So (1331) and (J34l) have the 
same expression with respect to the singular value decomposition. Clearly, 
in both cases, 

fj = {u[v/di, u'.v/drY ~ NriW'p, a^diag{d^^, df)), 

where r = min(n — l,^^). Hence our Bayes estimator given by (132|) is very 
reasonable in the sense that the components of fj w ith larger var i ances get 
shrunk more because vi > ■ ■ ■ > Vr- See Chapter 3 of iHastie et al.l (120011 ) for 
the detail of such a reasonable effect. 

On the other hand, the original Zellner's (7-prior cannot make a reasonable 
effect. In fact, when q < n — 2 and ui = ■ ■ ■ = Ur = 1 which implies that 
the prior given by (|TT1) reduces to the original Zellner's (7-prior, the estimator 
f l32|) becomes just a simple shrinkage estimator 



E[a~\l + g)-'\y] 
E[a-^\y] 



P- 



LSE- 
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Now we consider the concrete form of -E'[(T "^{l+g) ^\y]/E[a ^|?/]. In the 
similar way as in in Section [21 we have 

E[a-^il + gy'\y] 

+ ^)-/2+(n+l)/2|(^ + 1)RSS, + GESS,}-("+l)/2p,(^)rf^ • 

When q-y < n — 2 (which imphes r = g^) and b = (n — 5)/2 — r/2 — a, which 
is the same choice of b as in Section [21 we have 

E[a-\1 + g)-'\y] 
E[a'^\y] 

ira'iig + i)RSS^ + GESS^}-("+i)/2rf^ 



(7^(1 + ^?){(^? + 1)RSS, + GESS,}-("+i)/2t;^ 

r^?''+H(^7+i)Rss,+GEss,}-("+^)/^rf^ y' 

9'{{9+l)RSS^ + GESS^}-(-+^y^dg J 
B{r/2 + a + l,b + 2y 



1 + l^iy) 



B{r/2 + a + 2,b + l] 
n — — 3)/2 — 



= l + Uy)- 

V q^/2 + a + 1 

where a = —3/4 and 

, . RSS^ -|- GESS^ 

i-riy) = • 

On the other hand, when r = n — 1, RSS^ = and hence 
E[a-'{l+g)-'\y] 



E[a-^\y] 



+ g)p,{g)dg^ = {1 + E[g]}-\ (35) 



Hence we have to choose the mean of prior density of g for estimation of (3 
although we do not need any assumption of prior density of g at all for model 
selection. It is a good idea to let E[g] = d'^_i/d1 where dl/d'^_^ is known as 
the condition number of the linear equation. If d\/d'^_^ is extremely large 
which means the variance of the component fj^-i is also extremely large, 
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the coefScient of f}„_i of our estimator, d^_'^l{d\ + c?ii), is quite small. 
See ICasellal (119851 ) and iMaruyama and StrawdermanI (120051 ) for the detail of 
condition number. 

In summary, our estimator after model selection is 



(3, 



97 

E 

i=l 
n-l 



di 



1 



+ 1/4 



and = y — x-yP-y, where 



|(l + ,,„)"/2-,,/2-3/4^-'■ 

"l"n-l 

dKdl + dU) 



if q-y < n — 2 
if > n — 1 



When a new input variable Xq = (xqi, . . . , Xop)' comes, the predictive 
value of y at Xo, is given by 

Therefore from the Bayesian Model Averaging view point, the predictive 
value of y at xq is given by 



y + 



J2^C[My]iX0y-XyyPy 

EmMy] 



5 Model selection consistency 

The posterior consistency for model choice is 

plim^p{A4y\y) = 1 when Aiy is the true model, 

where plim denotes convergence in probability and the probability distribu- 
tion here is the sampling distribution under the true model Aiy. In this 
section, we show that our criterion of general form, MD[A1^] given by (125!) 
has a model selection consistency. The consistency property is clearly equiv- 
alent to 



plim, 



MD[My 



VTWy ^ My. 



(36) 



19 



Now we assume that for any model Ai-y, there exists a positive definite matrix 



such that 



limlz'^Z^ = H^. (37) 



n 



This is the standard assumption which appears in lKnight and Ful (120001 ) and 



Zod (120061 ) ■ Under the reasonable assumption (|37|) . we have some preliminary 



results for proving the consistency. 

Theorem 5.1. 1. Let have a spectral decomposition 

H^ = WAie^g{h,...,hjW:, 



where hi > ■ ■ ■ > hq^. Then \imn^aodi/n = hi and \imn-,oo d'j / dj = 
hi/hj for any i and j, where di > ■ ■ ■ > dq^ are the singular values of 

2. For the projection matrix onto the span of Zy, Py = Zy{ZyZy)~^Zy, 

iim^z;(/-Py)z^ = g[7,y] 

where Q[7,7'] is some x symmetric matrix. Further if the model 
M.y contains M.-^, Q[7,7'] becomes q^ x q^ zero matrix. If the model 
M.y does not contain Aij, <5[7,7'] is positive definite. 

3. Under the sampling model M.^, 

Plim^oo^ = <y' + /5;Q[7, (38) 
The main theorem on the consistency is as follows. 

Theorem 5.2. Assume (!37|) holds and vi is hounded from the above. Then 
the criterion MD[M.^\ is consistent for model selection. 

Proof. Note that 

vih'v < RSS^ + GESS^ < v'v 

and 

9 



=1 
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Then there exist Mi (7) < M2{'~f) which do not depend on n such that 

for sufficiently large n. Therefore, in order to show the consistency, it suffices 
to show that 

plim™n(^-V)/^(|g^)" = (39) 

because v'v/n, RSS^/ra and RSSy/n go to some constants respectively in 
probability by 

Consider the following two situations: 

1. M.y 2 TVly: Applying ( l38|) . RSS^/RSSy approaches 

aV(a2 + /?;Q[7,7']/57) 

which is strictly less than 1. Hence (RSS^/RSS^/)" converges to zero in 
probability exponentially fast with respect to n. Therefore, no matter 
what value — takes, ( l39i) is satisfied. 



2. MyCMy. Bv the result in lFernandez et all (120011). fRSS^./RSSy)"/^ 
converges in distribution to exp(x^ ,„g^/2). Since gy > q^, ( l39l) is sat- 
isfied. 

□ 



6 Simulation Study 

In this section, we compare numerical performance of our BC[M^] with com- 
petitors including AIC, BIG. We prepare 16 possible regressors {p = 16) as 
follows 

cor=o.9 cor=o.5 cor=o.i 

^TT^, 3:3, X4 , X7,xs N{0,1) 

cor=-o.7 cor=-o.3 
xn, X12 ~ N{0, 1), xi3, ~ U (-1, 1), ~ N{0, l),xie ~ U{-1, 1). 
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Here cor denotes the correlation of two normal random variables and U{a, h) 
denotes the uniform distribution between (a, h). And also (xi, X2), (X3, X4), (xs 
(0:7, xs), (xg, xio), xii, a;i2, xia, xm, xis, xie are independent. Since (xi, X2) and 
(x3, X4) are highly correlated, it may not be easy to detect it if the true model 
includes it. 

First we assume 14 true regressors xi, X2, . . . , X13, X14, that is, the true 
model is 
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Y = 1 + 2 Xi + normal error term. 



i=l 



We also assume that the sample size is n = 30, which means n > p case. 
The definition of competitors AIC, AICc ( Hurvich and Tsai ( 1989I )). BIG, 
MD are as follows 



n 



AIC = —2 X maximum log likelihood + 2(g + 2) 

AICc = —2 X maximum log likelihood + 2(g + 2) 

n — q — 6 

BIC = —2 X maximum log likelihood + q log n 

(RSS^)-('^-''-^)/2+3/4 B{q^/2 + 1/4, {n - q^)/2 - 3/4) 



MD 



(^/^)g,/2+i/4 5(1/4, (n - q^)/2 - 3/4) 



MD is given by MD[A1^] with a = —3/4, ui = ■ ■ ■ = Ug = 1 (which means 
naive Zellner's ^^-prior). Number of replication is = 200. The following 
table shows that how often the true model is the top 3 among 2^^ candidates. 





1st 


2nd 


3rd 


4th-last 


BC 


0.71 


0.15 


0.04 


0.10 


MD 


0.47 


0.20 


0.12 


0.31 


AIC 


0.38 


0.18 


0.20 


0.24 


AICc 


0.44 


0.20 


0.11 


0.25 


BIC 


0.54 


0.16 


0.16 


0.14 



Therefore BC has the best performance and we see that the extention of 
Zellner's ^f-prior with descending u is quite effective. 

Secondary we assume that the smaller model is true, 8 true regressors 
a^i, X2, X5, xq, xg, xio, xii, X13. The frequency of selection of the true model is 
as follows. Again our BC keep the best position while traditional criteria go 
down rapidly. 
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1st 


2nd 


3rd 


4th-last 


BC 


0.68 


0.15 


0.05 


0.12 


MD 


0.64 


0.12 


0.08 


0.16 


AIC 


0.08 


0.03 


0.02 


0.87 


AICc 


0.50 


0.14 


0.11 


0.25 


BIG 


0.26 


0.13 


0.07 


0.54 



Finally we again assume 14 true regressors Xi,X2, ■ ■ ■ , X13, X14 but the very 
small sample size n = 12. Actually this is a very difficult situation and there 
is no competitors in AIC, BIC, AICc, MD because of p > n. Since our 
BC could not unfortunately assign the true model as the top once among 
N = 200, alternatively, we consider the average of rank of each sub-models. 
As in table the true model is the top with respect to the average of ranks 
and hence we see that our criterion has an ability to find a true model. 
Interestingly we also observe the "grouping effect" , that is, if the variables 
are strongly correlated, the submodel including both of them is preferable to 
the submodel including one of them. 





1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


av.rank 


1st 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


F 


F 


2101 


2nd 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


F 


T 


2460 


3rd 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


F 


2611 


4th 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


2898 


5th 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


F 


T 


F 


T 


3306 


25th 


T 


F 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


F 


F 


4758 


26th 


F 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


T 


F 


F 


4796 
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